Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Kathy Hoang and 506333118

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Run this and add R_MAX_VSIZE=100Gb to avoid vector memory issue.

library(usethis) 
usethis::edit_r_environ()

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:    6.126 GiB 
rm(list = ls())

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

sid <- 10013310
# example in class
# sid <- 10001217
transfers_tbl <- read_csv("~/mimic/hosp/transfers.csv.gz", show_col_types = FALSE) |>
  filter(subject_id == sid)
transfers_tbl <- transfers_tbl |>
  filter(eventtype != "discharge", !is.na(outtime)) # to get rid of NAs
# filtering out missing values prevents it from appearing in the legend
# SET EVAL TO FALSE - this is to test ADT alone
transfers_tbl |>
  ggplot() +
  geom_segment(aes(
    x = intime,
    xend = outtime,
    y = "ADT", yend = "ADT", color = careunit,
    linewidth = str_detect(careunit, "(ICU|ICCU)")
  )) +
  # changed y and yend to "ADT"
  labs(
    x = "",
    y = "",
    title = "ADT history"
  ) +
  guides(linewidth = "none")

# look at missing values
sum(is.na(transfers_tbl$outtime))
# Created symbolic link labevents_pq in hw3 to labevents.parquet in mimic folder
labevents_tbl <- arrow::read_parquet("labevents_pq/part-0.parquet") |>
  filter(subject_id == sid)
patients_tbl <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == sid)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# SET EVAL TO FALSE - this is to test Lab events alone

labevents_tbl |>
  ggplot() +
  geom_point(shape = 3, aes(x = charttime, y = "lab")) +
  # change shape to cross = 3
  labs(
    x = "",
    y = "",
    title = "Lab events"
  )
procedures_tbl <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
  filter(subject_id == sid)
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures_tbl <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d_icd_procedures_tbl, 5)
# A tibble: 5 × 3
  icd_code icd_version long_title                                           
  <chr>          <dbl> <chr>                                                
1 0001               9 Therapeutic ultrasound of vessels of head and neck   
2 0002               9 Therapeutic ultrasound of heart                      
3 0003               9 Therapeutic ultrasound of peripheral vascular vessels
4 0009               9 Other therapeutic ultrasound                         
5 001               10 Central Nervous System and Cranial Nerves, Bypass    
# Merge procedures tables by icd code
procedures <- merge(procedures_tbl, d_icd_procedures_tbl, by = "icd_code")
head(procedures, 5)
  icd_code subject_id  hadm_id seq_num  chartdate icd_version.x icd_version.y
1  027034Z   10013310 27682188       1 2153-05-06            10            10
2  03CG3ZZ   10013310 22098926       1 2153-06-10            10            10
3  0DH63UZ   10013310 22098926       3 2153-07-15            10            10
4  3E05317   10013310 22098926       2 2153-06-10            10            10
5  3E0G76Z   10013310 22098926       4 2153-06-11            10            10
                                                                                            long_title
1 Dilation of Coronary Artery, One Artery with Drug-eluting Intraluminal Device, Percutaneous Approach
2                                Extirpation of Matter from Intracranial Artery, Percutaneous Approach
3                                      Insertion of Feeding Device into Stomach, Percutaneous Approach
4                     Introduction of Other Thrombolytic into Peripheral Artery, Percutaneous Approach
5               Introduction of Nutritional Substance into Upper GI, Via Natural or Artificial Opening
# SET EVAL TO FALSE - this is to test procedures alone
procedures |>
  ggplot() +
  geom_point(aes(x = chartdate, y = "procedure", shape = long_title), position = position_jitter(width = 0.2, seed = 203)) +
  labs(
    x = "",
    y = "",
    title = "Procedures",
    shape = "Procedure"
  ) +
  # to change order of shapes to match example
  scale_shape_manual(values = c(1:9)) +
  # no longer need this line above, changed shape = long_title instead of = icu_code
  # 15 = square, 16 = circle, 17 = triangle
  # change legend of shape to icd_code
  theme(legend.position = "bottom")
diagnoses_tbl <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == sid)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_diagnoses_tbl <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(d_icd_procedures_tbl, 5)
# A tibble: 5 × 3
  icd_code icd_version long_title                                           
  <chr>          <dbl> <chr>                                                
1 0001               9 Therapeutic ultrasound of vessels of head and neck   
2 0002               9 Therapeutic ultrasound of heart                      
3 0003               9 Therapeutic ultrasound of peripheral vascular vessels
4 0009               9 Other therapeutic ultrasound                         
5 001               10 Central Nervous System and Cranial Nerves, Bypass    
# Merge diagnoses tables by icd code
diagnoses <- merge(diagnoses_tbl, d_icd_diagnoses_tbl, by = "icd_code")
head(procedures, 5)
  icd_code subject_id  hadm_id seq_num  chartdate icd_version.x icd_version.y
1  027034Z   10013310 27682188       1 2153-05-06            10            10
2  03CG3ZZ   10013310 22098926       1 2153-06-10            10            10
3  0DH63UZ   10013310 22098926       3 2153-07-15            10            10
4  3E05317   10013310 22098926       2 2153-06-10            10            10
5  3E0G76Z   10013310 22098926       4 2153-06-11            10            10
                                                                                            long_title
1 Dilation of Coronary Artery, One Artery with Drug-eluting Intraluminal Device, Percutaneous Approach
2                                Extirpation of Matter from Intracranial Artery, Percutaneous Approach
3                                      Insertion of Feeding Device into Stomach, Percutaneous Approach
4                     Introduction of Other Thrombolytic into Peripheral Artery, Percutaneous Approach
5               Introduction of Nutritional Substance into Upper GI, Via Natural or Artificial Opening
# Top 3 Diagnosis in merged diagnoses table
top_3_diagnosis <- diagnoses |>
  count(long_title) |> # count frequency of diagnosis
  arrange(desc(n)) |> # order by frequency
  head(3) |>
  pull(long_title)

top_3_diagnosis
[1] "Acute on chronic systolic (congestive) heart failure"
[2] "Hyperlipidemia, unspecified"                         
[3] "Long term (current) use of insulin"                  
patients_tbl <- read_csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == sid)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# gender, anchor_age

admissions_tbl <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == sid)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# race

# Merge patients and admissions tables
patient_info_title <- merge(patients_tbl, admissions_tbl, by = "subject_id")
patient_info_title
  subject_id gender anchor_age anchor_year anchor_year_group        dod
1   10013310      F         70        2153       2017 - 2019 2153-11-19
2   10013310      F         70        2153       2017 - 2019 2153-11-19
3   10013310      F         70        2153       2017 - 2019 2153-11-19
   hadm_id           admittime           dischtime deathtime    admission_type
1 21243435 2153-05-26 14:18:00 2153-06-05 19:30:00      <NA> OBSERVATION ADMIT
2 22098926 2153-06-10 11:55:00 2153-07-21 18:00:00      <NA> OBSERVATION ADMIT
3 27682188 2153-05-06 18:03:00 2153-05-13 13:45:00      <NA>            URGENT
  admit_provider_id        admission_location       discharge_location
1            P78TNY INFORMATION NOT AVAILABLE         HOME HEALTH CARE
2            P09IS0 INFORMATION NOT AVAILABLE SKILLED NURSING FACILITY
3            P89ZCW    TRANSFER FROM HOSPITAL         HOME HEALTH CARE
  insurance language marital_status          race           edregtime
1  Medicare        ?         SINGLE BLACK/AFRICAN 2153-05-26 08:56:00
2  Medicare        ?         SINGLE BLACK/AFRICAN 2153-06-10 10:40:00
3  Medicare        ?         SINGLE BLACK/AFRICAN 2153-05-06 10:21:00
            edouttime hospital_expire_flag
1 2153-05-26 16:33:00                    0
2 2153-06-10 11:25:00                    0
3 2153-05-06 18:28:00                    0
race <- patient_info_title$race[1]
gender <- patient_info_title$gender[1]
age <- patient_info_title$anchor_age[1]
# Type needs to be the same on x-axis

# check data types of intime, chartime, and chartdate
transfers_tbl %>%
  select(intime, outtime) %>%
  map(class)
$intime
[1] "POSIXct" "POSIXt" 

$outtime
[1] "POSIXct" "POSIXt" 
labevents_tbl %>%
  select(charttime) %>%
  map(class)
$charttime
[1] "POSIXct" "POSIXt" 
procedures %>%
  select(chartdate) %>%
  map(class)
$chartdate
[1] "Date"
# Need to change chartdate to POSIXct type to be consistent and avoid error plotting on same plot!
procedures <- procedures %>%
  mutate(chartdate = as.POSIXct(chartdate, format = "%Y-%m-%d %H:%M:%S"))

procedures %>%
  select(chartdate) %>%
  map(class)
$chartdate
[1] "POSIXct" "POSIXt" 

1.1 Plot

# Combine all the plots
ggplot() +
  geom_segment(data = transfers_tbl, aes(
    x = intime,
    xend = outtime,
    y = "ADT",
    yend = "ADT",
    color = careunit,
    linewidth = str_detect(careunit, "(ICU|ICCU)")
  )) +
  geom_point(data = labevents_tbl, aes(x = charttime, y = "Lab"), shape = 3) +
  geom_point(data = procedures %>% mutate(long_title = str_sub(long_title, 1, 20)), aes(x = chartdate, y = "Procedure", shape = long_title), position = position_jitter(width = 0.2, seed = 11)) +
  scale_shape_manual(values = c(1:10)) +
  labs(
    x = "Calendar Time",
    y = "",
    title = paste("Patient", sid, gender, age, race, sep = ","),
    shape = "Procedure",
    color = "Care Unit",
    linewidth = "ICU/CCU",
    subtitle = paste(top_3_diagnosis, collapse = "\n")
  ) +
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
  theme_minimal() +
  # collpase on \n to print on separate lines
  theme(
    legend.box = "vertical",
    legend.position = "bottom",
    legend.justification = "left",
    legend.key.size = unit(0.5, "lines"),
    legend.key.width = unit(0.5, "lines"),
    legend.margin = margin(0, 50, 0, -50)
  ) +
  guides(
    shape = guide_legend(order = 1), # Order of shape legend
    color = guide_legend(order = 2), # Order of color legend
    linewidth = "none"
  ) # Order of size legend
Warning: Using linewidth for a discrete variable is not advised.

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer

sid <- 10013310
# example in class
# sid <- 10001217

# load icu stays dataset
icustays <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  filter(subject_id == sid)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays
# A tibble: 2 × 8
  subject_id  hadm_id  stay_id first_careunit  last_careunit intime             
       <dbl>    <dbl>    <dbl> <chr>           <chr>         <dttm>             
1   10013310 22098926 32769810 Neuro Surgical… Neuro Interm… 2153-06-10 11:55:42
2   10013310 27682188 31203589 Coronary Care … Coronary Car… 2153-05-06 18:28:00
# ℹ 2 more variables: outtime <dttm>, los <dbl>
# Created symbolic link chartevents_pq in hw3 to chartevents.parquet in mimic folder
# load chartevents dataset
chartevents <- arrow::open_dataset("chartevents_pq/part-0.parquet", format = "parquet") |>
  filter(subject_id == sid) |>
  filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) |> # filter for vitals
  collect() # what collect does is it takes the data from the arrow table and puts it into a tibble
chartevents
# A tibble: 549 × 5
   subject_id itemid charttime           value  stay_id
        <int>  <int> <dttm>              <chr>    <int>
 1   10013310 223761 2153-06-11 01:00:00 98.8  32769810
 2   10013310 220045 2153-06-11 02:00:00 113   32769810
 3   10013310 220210 2153-06-11 02:00:00 26    32769810
 4   10013310 220179 2153-06-11 02:02:00 131   32769810
 5   10013310 220181 2153-06-11 02:02:00 79    32769810
 6   10013310 220045 2153-06-12 00:00:00 121   32769810
 7   10013310 220210 2153-06-12 00:00:00 25    32769810
 8   10013310 220179 2153-06-12 00:03:00 134   32769810
 9   10013310 220181 2153-06-12 00:03:00 85    32769810
10   10013310 223761 2153-06-12 01:00:00 99    32769810
# ℹ 539 more rows
merged_stay <- inner_join(icustays, chartevents, by = "stay_id") # merge chartevents and icustays by stay_id
merged_stay
# A tibble: 549 × 12
   subject_id.x hadm_id stay_id first_careunit last_careunit intime             
          <dbl>   <dbl>   <dbl> <chr>          <chr>         <dttm>             
 1     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 2     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 3     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 4     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 5     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 6     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 7     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 8     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 9     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
10     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
# ℹ 539 more rows
# ℹ 6 more variables: outtime <dttm>, los <dbl>, subject_id.y <int>,
#   itemid <int>, charttime <dttm>, value <chr>
# Change charttime to POSIXct type
merged_stay <- merged_stay |>
  mutate(charttime = as.POSIXct(charttime, format = "%Y-%m-%d %H:%M:%S")) |>
  mutate(valuenum = as.numeric(value))

# Change itemid to vitals
merged_stay <- merged_stay |>
  mutate(vital_abbrev = case_when(
    itemid == 220045 ~ "HR",
    itemid == 220181 ~ "NBPd",
    itemid == 220179 ~ "NBPs",
    itemid == 223761 ~ "Temperature F",
    itemid == 220210 ~ "RR"
  ))
merged_stay
# A tibble: 549 × 14
   subject_id.x hadm_id stay_id first_careunit last_careunit intime             
          <dbl>   <dbl>   <dbl> <chr>          <chr>         <dttm>             
 1     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 2     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 3     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 4     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 5     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 6     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 7     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 8     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
 9     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
10     10013310  2.21e7  3.28e7 Neuro Surgica… Neuro Interm… 2153-06-10 11:55:42
# ℹ 539 more rows
# ℹ 8 more variables: outtime <dttm>, los <dbl>, subject_id.y <int>,
#   itemid <int>, charttime <dttm>, value <chr>, valuenum <dbl>,
#   vital_abbrev <chr>

1.2 Plot

# Plot
icustays |>
  ggplot() +
  geom_point(data = merged_stay, aes(x = charttime, y = valuenum, color = vital_abbrev), size = 1) +
  geom_line(data = merged_stay, aes(x = charttime, y = valuenum, color = vital_abbrev)) +
  facet_grid(rows = vars(vital_abbrev), cols = vars(stay_id), scales = "free") +
  # Use vars() to supply variables from the dataset: https://ggplot2.tidyverse.org/reference/facet_grid.html
  labs(
    x = "", # Calendar Time
    y = "", # Value of the vital
    title = paste("Patient", sid, "ICU Stays - Vitals", sep = ","),
    color = "Vital"
  ) +
  scale_x_datetime(date_labels = "%b-%d %H", guide = guide_axis(n.dodge = 2)) +
  # took out day breaks to solve overcrowded x-axis
  theme_minimal() +
  theme(
    legend.position = "none", # no legend
    panel.spacing = unit(0.5, "lines"),
    panel.border = element_rect(colour = "grey", fill = NA, size = 0.5),
    strip.background = element_rect(fill = "grey", colour = "grey"),
    strip.text = element_text(colour = "white", size = 6)
  )

# Source staggering: https://community.rstudio.com/t/x-axis-labels-overlap-want-to-rotate-labels-45/63800/3

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(icustays_tble, 5)
# A tibble: 5 × 8
  subject_id  hadm_id  stay_id first_careunit  last_careunit intime             
       <dbl>    <dbl>    <dbl> <chr>           <chr>         <dttm>             
1   10000032 29079034 39553978 Medical Intens… Medical Inte… 2180-07-23 14:00:00
2   10000980 26913865 39765666 Medical Intens… Medical Inte… 2189-06-27 08:42:00
3   10001217 24597018 37067082 Surgical Inten… Surgical Int… 2157-11-20 19:18:02
4   10001217 27703517 34592300 Surgical Inten… Surgical Int… 2157-12-19 15:42:24
5   10001725 25563031 31205490 Medical/Surgic… Medical/Surg… 2110-04-11 15:52:22
# ℹ 2 more variables: outtime <dttm>, los <dbl>

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer

# How many unique `subject_id`?
icustays_tble %>%
  count(subject_id) %>%
  nrow()
[1] 50920
# Can a `subject_id` have multiple ICU stays?
icustays_tble %>%
  count(subject_id) %>% # count the num of subject_id
  filter(n > 1) %>% # that have more than 1 ICU stay
  nrow()
[1] 12448
icustays_tble %>%
  count(subject_id) %>%
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(title = "Number of ICU Stays per Subject_id", x = "Number of ICU stays", y = "Number of subject_id")

There are 50920 unique subject_id in the icustays_tble.

Yes, a subject_id can have multiple ICU stays. There are 12448 subject_id with multiple ICU stays.

The graph shows the distribution of the number of ICU stays per subject_id. As depicted, the majority of subject_id have only one ICU stay. The next most common number of ICU stays is 2, followed by 3, 4, and 5. There are very few subject_id with more than 5 ICU stays.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(admissions_tble, 5)
# A tibble: 5 × 16
  subject_id hadm_id admittime           dischtime           deathtime
       <dbl>   <dbl> <dttm>              <dttm>              <dttm>   
1   10000032  2.26e7 2180-05-06 22:23:00 2180-05-07 17:15:00 NA       
2   10000032  2.28e7 2180-06-26 18:27:00 2180-06-27 18:49:00 NA       
3   10000032  2.57e7 2180-08-05 23:44:00 2180-08-07 17:50:00 NA       
4   10000032  2.91e7 2180-07-23 12:35:00 2180-07-25 17:55:00 NA       
5   10000068  2.50e7 2160-03-03 23:16:00 2160-03-04 06:26:00 NA       
# ℹ 11 more variables: admission_type <chr>, admit_provider_id <chr>,
#   admission_location <chr>, discharge_location <chr>, insurance <chr>,
#   language <chr>, marital_status <chr>, race <chr>, edregtime <dttm>,
#   edouttime <dttm>, hospital_expire_flag <dbl>

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer

# number of admissions per patient
admissions_tble %>%
  count(subject_id) %>%
  ggplot(aes(x = n)) +
  geom_bar() +
  labs(title = "Number of Admissions per Subject_id", x = "Number of admissions", y = "Number of subject_id") +
  scale_x_continuous(limits = c(0, 30))

# ZOOM IN
# make x axis scale smaller (this will lose some data)
# admission hour (anything unusual?)
admissions_tble %>%
  ggplot(aes(x = hour(admittime))) +
  geom_bar() +
  labs(title = "Admission Hour", x = "Hour", y = "Count")

Yes there is something unusual about the admission hour. The majority of admissions occur at 00:00, which is unusual because that is midnight. There is also a random peak at 7am. This could be due to the date shifting to protect patient confidentiality. It is also possible that these were opening hours for the hospital.

# admission minute (anything unusual?)
admissions_tble %>%
  ggplot(aes(x = minute(admittime))) +
  geom_bar() +
  labs(title = "Admission Minute", x = "Minute", y = "Count")

Yes there is something unusual about the admission minute. There is a peak every 15 minutes, at 0, 15, 30, and 45, where 0 is the start of the hour. It is possible that this is when the data was recorded the most; perhaps, patient intake is recorded every 15 minutes. It could also be due to time shifting to the nearest quarter hour to protect patient confidentiality.

# length of hospital stay (from admission to discharge) (anything unusual?)

time_diff <- admissions_tble$dischtime - admissions_tble$admittime # in seconds
time_diff <- as.numeric(time_diff, units = "hours") # convert to hours

admissions_tble %>%
  mutate(length_of_stay = time_diff) %>%
  ggplot(aes(x = length_of_stay)) +
  geom_histogram(binwidth = 100) + # pick better binwidth warning
  labs(title = "Length of Hospital Stay", x = "Length of stay (hours)", y = "Count")

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(patients_tble, 5)
# A tibble: 5 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10000032 F              52        2180 2014 - 2016       2180-09-09
2   10000048 F              23        2126 2008 - 2010       NA        
3   10000068 F              19        2160 2008 - 2010       NA        
4   10000084 M              72        2160 2017 - 2019       2161-02-13
5   10000102 F              27        2136 2008 - 2010       NA        

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

#Distribution of Age
ggplot(data = patients_tble, aes(x = anchor_age))+
  geom_histogram(binwidth = 5, fill = "lightblue", color = "skyblue")+
  labs(title = "Age Distribution", x = "Age", y = "Count")+
  theme_minimal()

##Distribution of Gender

ggplot(data = patients_tble, aes(x = gender, fill = gender))+
  geom_bar() +
  labs(title = "Gender Distribution", x = "Gender", y = "Count")+
  theme_minimal()

#Boxplot of Age by Gender
ggplot(data = patients_tble, aes(x = gender, y = anchor_age, fill= "lightblue"))+
  geom_boxplot()+
  labs(title = "Anchor Age Distribution by Gender", x = "Gender", y = "Age")+
  theme_minimal()+
  scale_fill_manual(values = "skyblue", name = "Gender") +
  guides(fill = guide_legend(title = "Gender")) +
  theme(legend.position = "none")

Answer -Patterns The distribution of age shows that the majority of patients are younger, mostly in their 20s-30s. As the age increases, the number of patients decreases. The gender distribution shows that there are slightly more females. The boxplot reveals that the age distributions of males and females are the same, with very similar medians and IQRs.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

#Read in dlabitems dataset
dlabitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz")
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dlabitems,5)
# A tibble: 5 × 4
  itemid label                               fluid category 
   <dbl> <chr>                               <chr> <chr>    
1  50801 Alveolar-arterial Gradient          Blood Blood Gas
2  50802 Base Excess                         Blood Blood Gas
3  50803 Calculated Bicarbonate, Whole Blood Blood Blood Gas
4  50804 Calculated Total CO2                Blood Blood Gas
5  50805 Carboxyhemoglobin                   Blood Blood Gas

Answer: Q5 output table

#Q5
#reread just in case it was overwritten in previous question
icustays <- read_csv("~/mimic/icu/icustays.csv.gz") |> print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows
itemid_list <- c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)
labevents_ds<- arrow::open_dataset("labevents_pq/part-0.parquet", format = "parquet") 

#Filter for the itemid_list and subject_id in icustays
labevents_filtered<- labevents_ds|>
  select(subject_id, itemid, valuenum, storetime) |>
  filter(itemid %in% itemid_list) |>
  filter(subject_id %in% icustays$subject_id) |>
  collect() |>
  print(width = Inf)
# A tibble: 13,473,870 × 4
   subject_id itemid valuenum storetime          
        <int>  <int>    <dbl> <dttm>             
 1   10000032  50882     27   2180-03-23 09:40:00
 2   10000032  50902    101   2180-03-23 09:40:00
 3   10000032  50912      0.4 2180-03-23 09:40:00
 4   10000032  50971      3.7 2180-03-23 09:40:00
 5   10000032  50983    136   2180-03-23 09:40:00
 6   10000032  50931     95   2180-03-23 08:56:00
 7   10000032  51221     45.4 2180-03-23 08:19:00
 8   10000032  51301      3   2180-03-23 08:19:00
 9   10000032  51221     42.6 2180-05-06 15:42:00
10   10000032  51301      5   2180-05-06 15:42:00
# ℹ 13,473,860 more rows
#Join with icustays to get stay_id
#Further restrict to the last available measurement (by storetime) 
labevents_final <- labevents_filtered |>
  left_join(icustays, by = "subject_id") |> #many to many relationship
  group_by(subject_id, stay_id, itemid) |> #need to group by to get the last measurement
   #restrict to the last available measurement (by storetime) before the ICU stay
  filter(storetime <= intime) |>
  arrange(storetime, .by_group = TRUE) |>
  slice_tail(n=1) |> #get the last measurement
  ungroup() |>
  select(subject_id, stay_id, itemid, valuenum) |>
  left_join(dlabitems, by = "itemid")
Warning in left_join(labevents_filtered, icustays, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 845 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
labevents_tble <- labevents_final |>
  select(subject_id, stay_id, valuenum, label) |>
  pivot_wider(names_from = label, values_from = valuenum)
labevents_tble
# A tibble: 68,468 × 10
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Potassium Sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,458 more rows
# ℹ 2 more variables: Hematocrit <dbl>, `White Blood Cells` <dbl>
#Output Final Result Table
labevents_tble |> print(width = Inf)
# A tibble: 68,468 × 10
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Potassium Sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
   Hematocrit `White Blood Cells`
        <dbl>               <dbl>
 1       41.1                 6.9
 2       27.3                 5.3
 3       37.4                 5.4
 4       38.1                15.7
 5       NA                  NA  
 6       39.7                12.2
 7       34.9                 7.2
 8       25.5                17.9
 9       22.4                 9.8
10       39.7                 7.9
# ℹ 68,458 more rows

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

#Q6
#run again just in case it was overwritten in previous question
icustays <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays |> print(width = Inf)
# A tibble: 73,181 × 8
   subject_id  hadm_id  stay_id first_careunit                                  
        <dbl>    <dbl>    <dbl> <chr>                                           
 1   10000032 29079034 39553978 Medical Intensive Care Unit (MICU)              
 2   10000980 26913865 39765666 Medical Intensive Care Unit (MICU)              
 3   10001217 24597018 37067082 Surgical Intensive Care Unit (SICU)             
 4   10001217 27703517 34592300 Surgical Intensive Care Unit (SICU)             
 5   10001725 25563031 31205490 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 26184834 37510196 Medical Intensive Care Unit (MICU)              
 7   10002013 23581541 39060235 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 20345487 32358465 Medical Intensive Care Unit (MICU)              
 9   10002155 23822395 33685454 Coronary Care Unit (CCU)                        
10   10002155 28994087 31090461 Medical/Surgical Intensive Care Unit (MICU/SICU)
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-12-19 15:42:24
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Coronary Care Unit (CCU)                         2129-08-04 12:45:00
10 Medical/Surgical Intensive Care Unit (MICU/SICU) 2130-09-24 00:50:00
   outtime               los
   <dttm>              <dbl>
 1 2180-07-23 23:50:47 0.410
 2 2189-06-27 20:38:27 0.498
 3 2157-11-21 22:08:00 1.12 
 4 2157-12-20 14:27:41 0.948
 5 2110-04-12 23:59:56 1.34 
 6 2131-01-20 08:27:30 9.17 
 7 2160-05-19 17:33:33 1.31 
 8 2131-03-10 18:09:21 0.859
 9 2129-08-10 17:02:38 6.18 
10 2130-09-27 22:13:41 3.89 
# ℹ 73,171 more rows
itemid_list <- c(220045, 220179, 220180, 223761, 220210)
chartevents_tble <- arrow::open_dataset("chartevents_pq/part-0.parquet", format = "parquet") |> 
  filter(itemid %in% itemid_list) |> 
  #filter(subject_id %in% icustays$subject_id) |> 
  select(subject_id, itemid, value, charttime) |>
  collect() |> print(width = Inf)
# A tibble: 18,437,782 × 4
   subject_id itemid value charttime          
        <int>  <int> <chr> <dttm>             
 1   10022584 220179 104   2180-11-17 10:00:00
 2   10022584 220210 18    2180-11-17 10:00:00
 3   10022584 220045 53    2180-11-17 00:00:00
 4   10022584 220179 105   2180-11-17 00:00:00
 5   10022584 220210 12    2180-11-17 00:00:00
 6   10022584 223761 97    2180-11-17 00:00:00
 7   10022584 220045 68    2180-11-17 03:00:00
 8   10022584 220210 18    2180-11-17 03:00:00
 9   10022584 220179 120   2180-11-17 03:15:00
10   10022584 220045 62    2180-11-17 05:00:00
# ℹ 18,437,772 more rows
#note: my parquet from hw2 doesnt have valuenum so we need to convert value to numeric (won't need to if your parquet includes valuenum already)
chartevents_tble$valuenum = as.numeric(chartevents_tble$value)
chartevents_tble |> print(width = Inf)
# A tibble: 18,437,782 × 5
   subject_id itemid value charttime           valuenum
        <int>  <int> <chr> <dttm>                 <dbl>
 1   10022584 220179 104   2180-11-17 10:00:00      104
 2   10022584 220210 18    2180-11-17 10:00:00       18
 3   10022584 220045 53    2180-11-17 00:00:00       53
 4   10022584 220179 105   2180-11-17 00:00:00      105
 5   10022584 220210 12    2180-11-17 00:00:00       12
 6   10022584 223761 97    2180-11-17 00:00:00       97
 7   10022584 220045 68    2180-11-17 03:00:00       68
 8   10022584 220210 18    2180-11-17 03:00:00       18
 9   10022584 220179 120   2180-11-17 03:15:00      120
10   10022584 220045 62    2180-11-17 05:00:00       62
# ℹ 18,437,772 more rows
#lost charttime, need it for q7
chartevents_tble_copy <- chartevents_tble
#Read in d_items dataset
d_items.csv.gz <- read_csv("~/mimic/icu/d_items.csv.gz")
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_items.csv.gz |> print(width = Inf)
# A tibble: 4,014 × 9
   itemid label                               abbreviation       linksto       
    <dbl> <chr>                               <chr>              <chr>         
 1 220001 Problem List                        Problem List       chartevents   
 2 220003 ICU Admission date                  ICU Admission date datetimeevents
 3 220045 Heart Rate                          HR                 chartevents   
 4 220046 Heart rate Alarm - High             HR Alarm - High    chartevents   
 5 220047 Heart Rate Alarm - Low              HR Alarm - Low     chartevents   
 6 220048 Heart Rhythm                        Heart Rhythm       chartevents   
 7 220050 Arterial Blood Pressure systolic    ABPs               chartevents   
 8 220051 Arterial Blood Pressure diastolic   ABPd               chartevents   
 9 220052 Arterial Blood Pressure mean        ABPm               chartevents   
10 220056 Arterial Blood Pressure Alarm - Low ABP Alarm - Low    chartevents   
   category            unitname param_type    lownormalvalue highnormalvalue
   <chr>               <chr>    <chr>                  <dbl>           <dbl>
 1 General             <NA>     Text                      NA              NA
 2 ADT                 <NA>     Date and time             NA              NA
 3 Routine Vital Signs bpm      Numeric                   NA              NA
 4 Alarms              bpm      Numeric                   NA              NA
 5 Alarms              bpm      Numeric                   NA              NA
 6 Routine Vital Signs <NA>     Text                      NA              NA
 7 Routine Vital Signs mmHg     Numeric                   90             140
 8 Routine Vital Signs mmHg     Numeric                   60              90
 9 Routine Vital Signs mmHg     Numeric                   NA              NA
10 Alarms              mmHg     Numeric                   NA              NA
# ℹ 4,004 more rows
#convert time intime and outtime
icustays$intime <- as.POSIXct(icustays$intime, format = "%Y-%m-%d %H:%M:%S")
icustays$outtime <- as.POSIXct(icustays$outtime, format = "%Y-%m-%d %H:%M:%S")

#Note prof converts subject_id to integer in example output
icustays$subject_id <- as.integer(icustays$subject_id)
chartevents_tble$subject_id <- as.integer(chartevents_tble$subject_id)

#Join with icustays to get stay_id
chartevents_tble_filtered <- chartevents_tble |>
  inner_join(icustays, by = "subject_id") |>
  group_by(subject_id, stay_id,itemid) |> #note stay_id will get renamed after join
  #Further restrict to the first vital measurement within the ICU stay
  filter(charttime >= intime, charttime <= outtime) |>
  arrange(charttime, .by_group = TRUE) |>
  slice_head(n=1) |>
  ungroup() |>
  select(subject_id, stay_id, itemid, valuenum) |>
  #Join with d_items to get label
  left_join(d_items.csv.gz, by = "itemid")
Warning in inner_join(chartevents_tble, icustays, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 725 of `x` matches multiple rows in `y`.
ℹ Row 180 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
chartevents_tble_filtered
# A tibble: 290,267 × 12
   subject_id  stay_id itemid valuenum label       abbreviation linksto category
        <int>    <dbl>  <dbl>    <dbl> <chr>       <chr>        <chr>   <chr>   
 1   10000032 39553978 220045     91   Heart Rate  HR           charte… Routine…
 2   10000032 39553978 220179     84   Non Invasi… NBPs         charte… Routine…
 3   10000032 39553978 220210     24   Respirator… RR           charte… Respira…
 4   10000032 39553978 223761     98.7 Temperatur… Temperature… charte… Routine…
 5   10000980 39765666 220045     77   Heart Rate  HR           charte… Routine…
 6   10000980 39765666 220179    150   Non Invasi… NBPs         charte… Routine…
 7   10000980 39765666 220210     23   Respirator… RR           charte… Respira…
 8   10000980 39765666 223761     98   Temperatur… Temperature… charte… Routine…
 9   10001217 34592300 220045     96   Heart Rate  HR           charte… Routine…
10   10001217 34592300 220179    167   Non Invasi… NBPs         charte… Routine…
# ℹ 290,257 more rows
# ℹ 4 more variables: unitname <chr>, param_type <chr>, lownormalvalue <dbl>,
#   highnormalvalue <dbl>
chartevents_tble <- chartevents_tble_filtered |>
  select(subject_id, stay_id, valuenum, label) |>
  pivot_wider(names_from = label, values_from = valuenum)
  #select(subject_id, stay_id, itemid, valuenum) |>
  #pivot_wider(names_from = label, values_from = valuenum)
#Output Final Result Table
chartevents_tble
# A tibble: 73,164 × 6
   subject_id  stay_id `Heart Rate` Non Invasive Blood Pres…¹ `Respiratory Rate`
        <int>    <dbl>        <dbl>                     <dbl>              <dbl>
 1   10000032 39553978           91                        84                 24
 2   10000980 39765666           77                       150                 23
 3   10001217 34592300           96                       167                 11
 4   10001217 37067082           86                       151                 18
 5   10001725 31205490           55                        73                 19
 6   10001884 37510196           38                       180                 10
 7   10002013 39060235           80                       104                 14
 8   10002155 31090461           94                       118                 18
 9   10002155 32358465           98                       109                 23
10   10002155 33685454           68                       126                 18
# ℹ 73,154 more rows
# ℹ abbreviated name: ¹​`Non Invasive Blood Pressure systolic`
# ℹ 1 more variable: `Temperature Fahrenheit` <dbl>

Answer: Q6 output table

#add charttime back to the table (need this for q7)
chartevents_tble_copy <- chartevents_tble_copy |> 
  select(subject_id,charttime)

chartevents_tble <- chartevents_tble |>
 left_join(chartevents_tble_copy |> distinct(subject_id, .keep_all = TRUE), by = c("subject_id")) |>
 select(everything())
chartevents_tble
# A tibble: 73,164 × 7
   subject_id  stay_id `Heart Rate` Non Invasive Blood Pres…¹ `Respiratory Rate`
        <int>    <dbl>        <dbl>                     <dbl>              <dbl>
 1   10000032 39553978           91                        84                 24
 2   10000980 39765666           77                       150                 23
 3   10001217 34592300           96                       167                 11
 4   10001217 37067082           86                       151                 18
 5   10001725 31205490           55                        73                 19
 6   10001884 37510196           38                       180                 10
 7   10002013 39060235           80                       104                 14
 8   10002155 31090461           94                       118                 18
 9   10002155 32358465           98                       109                 23
10   10002155 33685454           68                       126                 18
# ℹ 73,154 more rows
# ℹ abbreviated name: ¹​`Non Invasive Blood Pressure systolic`
# ℹ 2 more variables: `Temperature Fahrenheit` <dbl>, charttime <dttm>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer

#Q7
#merge all together
mimic_icu_cohort_merged <- icustays_tble |>
  left_join(admissions_tble, by = "subject_id") |>
  left_join(patients_tble, by = "subject_id") |>
  left_join(labevents_tble, by = "subject_id") |>
  left_join(chartevents_tble, by = "subject_id") |>
  #age at intime
  filter(anchor_age >= 18)
Warning in left_join(icustays_tble, admissions_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 40 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
Warning in left_join(left_join(left_join(icustays_tble, admissions_tble, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 12 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
Warning in left_join(left_join(left_join(left_join(icustays_tble, admissions_tble, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 12 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
#first vital measurements during the ICU stay
mimic_icu_cohort <- mimic_icu_cohort_merged |>
  group_by(subject_id, stay_id)|>
  filter(charttime >= intime, charttime <= outtime) |>
  arrange(charttime, .by_group = TRUE) |>
  slice_head(n=1) |>
  ungroup() |>
  #rename hadm_id.x and stay_id.x to hadm_id and stay_id
  # #select all variables from icustays, admissions, patients
  #select(subject_id, stay_id, intime, outtime, admittime, dischtime)
  select(everything()) |>
  #select(subject_id, hadm_id.x, stay_id.x, first_careunit, intime, outtime, los, admittime, dischtime, deathtime, admission_type, admit_provider_id, admission_location, discharge_location, insurance, language, marital_status,race, edregtime, edouttime,hospital_expire_flag,gender,anchor_age,anchor_year,anchor_year_group,dod) |>
  select(-c(hadm_id.x, stay_id.x))

#Output Final Result Table
mimic_icu_cohort |> print(width = Inf) 
# A tibble: 69,484 × 41
   subject_id first_careunit                                  
        <dbl> <chr>                                           
 1   10000032 Medical Intensive Care Unit (MICU)              
 2   10000980 Medical Intensive Care Unit (MICU)              
 3   10001217 Surgical Intensive Care Unit (SICU)             
 4   10001217 Surgical Intensive Care Unit (SICU)             
 5   10001725 Medical/Surgical Intensive Care Unit (MICU/SICU)
 6   10001884 Medical Intensive Care Unit (MICU)              
 7   10002013 Cardiac Vascular Intensive Care Unit (CVICU)    
 8   10002155 Medical Intensive Care Unit (MICU)              
 9   10002155 Medical Intensive Care Unit (MICU)              
10   10002155 Medical Intensive Care Unit (MICU)              
   last_careunit                                    intime             
   <chr>                                            <dttm>             
 1 Medical Intensive Care Unit (MICU)               2180-07-23 14:00:00
 2 Medical Intensive Care Unit (MICU)               2189-06-27 08:42:00
 3 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 4 Surgical Intensive Care Unit (SICU)              2157-11-20 19:18:02
 5 Medical/Surgical Intensive Care Unit (MICU/SICU) 2110-04-11 15:52:22
 6 Medical Intensive Care Unit (MICU)               2131-01-11 04:20:05
 7 Cardiac Vascular Intensive Care Unit (CVICU)     2160-05-18 10:00:53
 8 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
 9 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
10 Medical Intensive Care Unit (MICU)               2131-03-09 21:33:00
   outtime               los hadm_id.y admittime           dischtime          
   <dttm>              <dbl>     <dbl> <dttm>              <dttm>             
 1 2180-07-23 23:50:47 0.410  22595853 2180-05-06 22:23:00 2180-05-07 17:15:00
 2 2189-06-27 20:38:27 0.498  20897796 2193-08-15 01:01:00 2193-08-17 15:07:00
 3 2157-11-21 22:08:00 1.12   24597018 2157-11-18 22:56:00 2157-11-25 18:00:00
 4 2157-11-21 22:08:00 1.12   24597018 2157-11-18 22:56:00 2157-11-25 18:00:00
 5 2110-04-12 23:59:56 1.34   25563031 2110-04-11 15:08:00 2110-04-14 15:00:00
 6 2131-01-20 08:27:30 9.17   21192799 2130-10-05 20:04:00 2130-10-06 15:05:00
 7 2160-05-19 17:33:33 1.31   21763296 2165-11-23 08:19:00 2165-11-26 15:40:00
 8 2131-03-10 18:09:21 0.859  20345487 2131-03-09 20:33:00 2131-03-10 01:55:00
 9 2131-03-10 18:09:21 0.859  20345487 2131-03-09 20:33:00 2131-03-10 01:55:00
10 2131-03-10 18:09:21 0.859  20345487 2131-03-09 20:33:00 2131-03-10 01:55:00
   deathtime           admission_type    admit_provider_id
   <dttm>              <chr>             <chr>            
 1 NA                  URGENT            P874LG           
 2 NA                  OBSERVATION ADMIT P77SO2           
 3 NA                  EW EMER.          P4645A           
 4 NA                  EW EMER.          P4645A           
 5 NA                  EW EMER.          P35SU0           
 6 NA                  EU OBSERVATION    P906TK           
 7 NA                  DIRECT EMER.      P072C5           
 8 2131-03-10 21:53:00 EW EMER.          P80515           
 9 2131-03-10 21:53:00 EW EMER.          P80515           
10 2131-03-10 21:53:00 EW EMER.          P80515           
   admission_location     discharge_location insurance language marital_status
   <chr>                  <chr>              <chr>     <chr>    <chr>         
 1 TRANSFER FROM HOSPITAL HOME               Other     ENGLISH  WIDOWED       
 2 WALK-IN/SELF REFERRAL  HOME HEALTH CARE   Other     ENGLISH  MARRIED       
 3 EMERGENCY ROOM         HOME HEALTH CARE   Other     ?        MARRIED       
 4 EMERGENCY ROOM         HOME HEALTH CARE   Other     ?        MARRIED       
 5 PACU                   HOME               Other     ENGLISH  MARRIED       
 6 EMERGENCY ROOM         <NA>               Medicare  ENGLISH  MARRIED       
 7 CLINIC REFERRAL        HOME HEALTH CARE   Other     ENGLISH  SINGLE        
 8 EMERGENCY ROOM         DIED               Other     ENGLISH  MARRIED       
 9 EMERGENCY ROOM         DIED               Other     ENGLISH  MARRIED       
10 EMERGENCY ROOM         DIED               Other     ENGLISH  MARRIED       
   race                   edregtime           edouttime          
   <chr>                  <dttm>              <dttm>             
 1 WHITE                  2180-05-06 19:17:00 2180-05-06 23:30:00
 2 BLACK/AFRICAN AMERICAN 2193-08-14 21:25:00 2193-08-15 02:22:00
 3 WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 4 WHITE                  2157-11-18 17:38:00 2157-11-19 01:24:00
 5 WHITE                  NA                  NA                 
 6 BLACK/AFRICAN AMERICAN 2130-10-05 11:58:00 2130-10-06 15:05:00
 7 WHITE                  2165-11-22 20:54:00 2165-11-23 16:04:00
 8 WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
 9 WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
10 WHITE                  2131-03-09 19:14:00 2131-03-09 21:33:00
   hospital_expire_flag gender anchor_age anchor_year anchor_year_group
                  <dbl> <chr>       <dbl>       <dbl> <chr>            
 1                    0 F              52        2180 2014 - 2016      
 2                    0 F              73        2186 2008 - 2010      
 3                    0 F              55        2157 2011 - 2013      
 4                    0 F              55        2157 2011 - 2013      
 5                    0 F              46        2110 2011 - 2013      
 6                    0 F              68        2122 2008 - 2010      
 7                    0 F              53        2156 2008 - 2010      
 8                    1 F              80        2128 2008 - 2010      
 9                    1 F              80        2128 2008 - 2010      
10                    1 F              80        2128 2008 - 2010      
   dod        stay_id.y Bicarbonate Chloride Creatinine Glucose Potassium Sodium
   <date>         <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1 2180-09-09  39553978          25       95        0.7     102       6.7    126
 2 2193-08-26  39765666          21      109        2.3      89       3.9    144
 3 NA          34592300          30      104        0.5      87       4.1    142
 4 NA          34592300          30      104        0.5      87       4.1    142
 5 NA          31205490          NA       98       NA        NA       4.1    139
 6 2131-01-20  37510196          30       88        1.1     141       4.5    130
 7 NA          39060235          24      102        0.9     288       3.5    137
 8 2131-03-10  31090461          23       98        2.8     117       4.9    135
 9 2131-03-10  31090461          23       98        2.8     117       4.9    135
10 2131-03-10  31090461          23       98        2.8     117       4.9    135
   Hematocrit `White Blood Cells`  stay_id `Heart Rate`
        <dbl>               <dbl>    <dbl>        <dbl>
 1       41.1                 6.9 39553978           91
 2       27.3                 5.3 39765666           77
 3       37.4                 5.4 34592300           96
 4       37.4                 5.4 37067082           86
 5       NA                  NA   31205490           55
 6       39.7                12.2 37510196           38
 7       34.9                 7.2 39060235           80
 8       25.5                17.9 31090461           94
 9       25.5                17.9 32358465           98
10       25.5                17.9 33685454           68
   `Non Invasive Blood Pressure systolic` `Respiratory Rate`
                                    <dbl>              <dbl>
 1                                     84                 24
 2                                    150                 23
 3                                    167                 11
 4                                    151                 18
 5                                     73                 19
 6                                    180                 10
 7                                    104                 14
 8                                    118                 18
 9                                    109                 23
10                                    126                 18
   `Temperature Fahrenheit` charttime          
                      <dbl> <dttm>             
 1                     98.7 2180-07-23 14:01:00
 2                     98   2189-06-27 01:54:00
 3                     97.6 2157-11-21 11:00:00
 4                     98.5 2157-11-21 11:00:00
 5                     97.7 2110-04-12 12:00:00
 6                     98.1 2131-01-18 11:00:00
 7                     97.2 2160-05-18 13:00:00
 8                     96.9 2131-03-09 13:45:00
 9                     97.7 2131-03-09 13:45:00
10                     95.9 2131-03-09 13:45:00
# ℹ 69,474 more rows

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)
summary_stats <- mimic_icu_cohort %>%
  group_by(race, insurance, marital_status, gender) %>%
  summarize(
    mean_los = mean(los),
    median_los = median(los),
    sd_los = sd(los)
  )
`summarise()` has grouped output by 'race', 'insurance', 'marital_status'. You
can override using the `.groups` argument.
#print summary stats mean, median, sd
summary_stats |> print(width = Inf)
# A tibble: 719 × 7
# Groups:   race, insurance, marital_status [415]
   race                          insurance marital_status gender mean_los
   <chr>                         <chr>     <chr>          <chr>     <dbl>
 1 AMERICAN INDIAN/ALASKA NATIVE Medicaid  DIVORCED       F         0.834
 2 AMERICAN INDIAN/ALASKA NATIVE Medicaid  MARRIED        F         6.21 
 3 AMERICAN INDIAN/ALASKA NATIVE Medicaid  MARRIED        M        21.4  
 4 AMERICAN INDIAN/ALASKA NATIVE Medicaid  SINGLE         F         1.33 
 5 AMERICAN INDIAN/ALASKA NATIVE Medicaid  SINGLE         M         1.19 
 6 AMERICAN INDIAN/ALASKA NATIVE Medicaid  <NA>           F        16.9  
 7 AMERICAN INDIAN/ALASKA NATIVE Medicare  DIVORCED       F         1.00 
 8 AMERICAN INDIAN/ALASKA NATIVE Medicare  DIVORCED       M         5.54 
 9 AMERICAN INDIAN/ALASKA NATIVE Medicare  MARRIED        F        11.0  
10 AMERICAN INDIAN/ALASKA NATIVE Medicare  MARRIED        M         3.00 
   median_los sd_los
        <dbl>  <dbl>
 1      0.834  0    
 2      6.21  NA    
 3     21.4    0    
 4      1.33   0.570
 5      1.19   0    
 6     16.9   NA    
 7      0.913  0.156
 8      8.53   3.86 
 9     13.1    7.70 
10      3.26   0.948
# ℹ 709 more rows
#Race Distribution
ggplot(data = summary_stats, aes(x = race, fill= "mediumpurple"))+
  geom_bar()+
  labs(title = "Distribution of Race", x = "Race", y = "Count")+
  theme_minimal()+
  scale_fill_manual(values = "mediumpurple") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Average LOS by Race
ggplot(summary_stats, aes(x = race, y = mean_los)) +
  geom_bar(stat = "identity", fill = "mediumpurple") +
  labs(x = "Race", y = "Average Length of ICU Stay (LOS)") +
  ggtitle("Average Length of Stay(LOS) by Race") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Distribution of Insurance
ggplot(data = summary_stats, aes(x = insurance, fill= "lightyellow"))+
  geom_bar()+
  labs(title = "Distribution of Insurance", x = "Insurance Type", y = "Count")+
  theme_minimal()+
  scale_fill_manual(values = "lightyellow") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Create barplot of average los by insurance
ggplot(summary_stats, aes(x = insurance, y = mean_los)) +
  geom_bar(stat = "identity", fill = "lightyellow") +
  labs(x = "Insurance Type", y = "Average Length of ICU Stay (LOS)") +
  ggtitle("Average Length of Stay(LOS) by Insurance") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#marital_status
ggplot(data = summary_stats, aes(x = marital_status, fill= "lightblue"))+
  geom_bar()+
  labs(title = "Distribution of Marital Status", x = "Marital Status", y = "Count")+
  theme_minimal()+
  scale_fill_manual(values = "lightgreen") +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#LOS vs marital status
ggplot(summary_stats, aes(x = marital_status, y = mean_los)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(x = "Marital Status", y = "Average Length of ICU Stay (LOS)") +
  ggtitle("Average Length of Stay(LOS) by Marital Status") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#LOS vs gender Baplots
ggplot(summary_stats, aes(x = gender, y = mean_los)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(x = "Marital Status", y = "Average Length of ICU Stay (LOS)") +
  ggtitle("Average Length of Stay(LOS) by Gender") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#two histogram distributions overlapping (by gender)
ggplot(summary_stats, aes(x = mean_los, fill = gender)) +
  geom_histogram(binwidth = 5, position = "dodge", alpha = 0.7) +
  facet_wrap(~ gender, scales = "free") +
  labs(x = "Average Length of ICU Stay (LOS)", y = "Count") +
  ggtitle("Average Length of Stay (LOS) Distribution by Gender") +
  theme_minimal()

#Boxplot of LOS by Gender
ggplot(data = mimic_icu_cohort, aes(x = gender, y = los, fill= "lightblue"))+
  geom_boxplot()+
  labs(title = "Length of Stay (LOS) Distribution by Gender", x = "Gender", y = "Length of Stay (LOS)")+
  theme_minimal()+
  scale_fill_manual(values = "skyblue", name = "Gender") +
  guides(fill = guide_legend(title = "Gender")) +
  theme(legend.position = "none") +
  scale_y_log10()  # Log scale for y-axis

#the distribution seems to be heavily skewed, so I applied a log scale for the y-axis
#age & los scatterplot
#Age at intime vs Length of ICU stay
mimic_icu_cohort |>
  ggplot(aes(x = anchor_age, y = los)) +
  geom_point(aes(color=mimic_icu_cohort$anchor_age)) +
  labs(title = "Length of ICU Stay vs Age", x = "Length of ICU stay", y = "Age at Intime")
Warning: Use of `mimic_icu_cohort$anchor_age` is discouraged.
ℹ Use `anchor_age` instead.

#change legend title
#age & los by gender scatterplot
# Create scatterplot of los vs age at intime
ggplot(mimic_icu_cohort, aes(x = anchor_age, y = los, color = gender)) +
  geom_point() +
  labs(x = "Age at Intime", y = "Length of ICU Stay (LOS)") +
  ggtitle("Scatterplot of LOS vs Age at Intime") +
  scale_color_manual(values = c("lightblue", "lightpink"))  # Custom color for gender

  • Length of ICU stay los vs the last available lab measurements before ICU stay
#select lab measurements columns
lab_measurements <- mimic_icu_cohort |>
  select(subject_id, stay_id, Bicarbonate, Chloride, Creatinine, Glucose, Hematocrit, Potassium, Sodium, "White Blood Cells", los, charttime, intime) 
#before icu stay
 lab_measurements <- lab_measurements |> 
   group_by(subject_id, stay_id) |>
   filter(charttime <= intime) |> 
   arrange(charttime, .by_group = TRUE) |>
   slice_tail(n=1) |>
   print(width = Inf)
# A tibble: 183 × 13
# Groups:   subject_id, stay_id [183]
   subject_id  stay_id Bicarbonate Chloride Creatinine Glucose Hematocrit
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>      <dbl>
 1   10204703 39209965          21      109        0.7      80       27.8
 2   10274736 36064661          19       99        0.7     187       45.7
 3   10298415 34197081          NA       NA        1.2      NA       36.7
 4   10549680 33186250          21      105        1.1      97       45.3
 5   10591033 37468075          24       99        1.3     430       40  
 6   10591889 39647436          28      103        2.5      93       37.3
 7   10621049 38936957          36       97        0.7     113       52.3
 8   11050108 37420759          NA       NA        0.5      NA       38.6
 9   11094262 37772362          25      102        1.3     157       38.9
10   11219612 37996417          NA      100        0.9      NA       43.1
   Potassium Sodium `White Blood Cells`   los charttime          
       <dbl>  <dbl>               <dbl> <dbl> <dttm>             
 1       3.9    139                 8.3 0.471 2178-01-08 19:30:00
 2       4.6    137                13.7 1.02  2175-03-19 17:01:00
 3      NA       NA                20.3 1.79  2120-06-19 18:10:00
 4       3.2    142                11.6 0.493 2168-01-02 16:55:00
 5       4.4    139                16.4 1.46  2122-03-03 22:00:00
 6       4.1    149                 5.1 1.51  2183-03-22 23:07:00
 7       5.6    142                 8.5 5.58  2129-12-21 18:51:00
 8      NA       NA                13.3 0.912 2188-11-01 18:59:00
 9       5      138                 8.2 0.953 2164-07-31 12:12:00
10       3.8    139                 9.9 0.718 2133-10-11 18:30:00
   intime             
   <dttm>             
 1 2178-01-09 03:30:00
 2 2175-03-20 00:01:00
 3 2120-06-20 01:10:00
 4 2168-01-03 00:55:00
 5 2122-03-04 06:00:00
 6 2183-03-23 06:07:00
 7 2129-12-22 02:51:00
 8 2188-11-02 01:59:00
 9 2164-07-31 19:12:00
10 2133-10-12 01:30:00
# ℹ 173 more rows
 #only include columns with lab measurements and los
 lab_measurements_filtered <- lab_measurements |>
   select(-c(charttime, intime)) |>
   pivot_longer(cols = -c(subject_id, stay_id, los), names_to = "lab", values_to = "value") |>
   print(width = Inf)
# A tibble: 1,464 × 5
# Groups:   subject_id, stay_id [183]
   subject_id  stay_id   los lab               value
        <dbl>    <dbl> <dbl> <chr>             <dbl>
 1   10204703 39209965 0.471 Bicarbonate        21  
 2   10204703 39209965 0.471 Chloride          109  
 3   10204703 39209965 0.471 Creatinine          0.7
 4   10204703 39209965 0.471 Glucose            80  
 5   10204703 39209965 0.471 Hematocrit         27.8
 6   10204703 39209965 0.471 Potassium           3.9
 7   10204703 39209965 0.471 Sodium            139  
 8   10204703 39209965 0.471 White Blood Cells   8.3
 9   10274736 36064661 1.02  Bicarbonate        19  
10   10274736 36064661 1.02  Chloride           99  
# ℹ 1,454 more rows
 ggplot(lab_measurements_filtered, aes(x = value, y = los)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~lab, scales = "free") +
  labs(x = "Lab Measurement", y = "Length of ICU Stay (LOS)") +
  ggtitle("Length of ICU Stay vs Last Available Lab Measurements Before ICU Stay")
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: Removed 89 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 89 rows containing missing values (`geom_point()`).

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay
#to see the columns
#colnames(mimic_icu_cohort)

vital_measurements <- mimic_icu_cohort |>
  select(subject_id, stay_id, "Heart Rate", "Respiratory Rate", "Non Invasive Blood Pressure systolic", "Temperature Fahrenheit", los, intime, outtime, charttime)
vital_measurements
# A tibble: 69,484 × 10
   subject_id  stay_id `Heart Rate` `Respiratory Rate` Non Invasive Blood Pres…¹
        <dbl>    <dbl>        <dbl>              <dbl>                     <dbl>
 1   10000032 39553978           91                 24                        84
 2   10000980 39765666           77                 23                       150
 3   10001217 34592300           96                 11                       167
 4   10001217 37067082           86                 18                       151
 5   10001725 31205490           55                 19                        73
 6   10001884 37510196           38                 10                       180
 7   10002013 39060235           80                 14                       104
 8   10002155 31090461           94                 18                       118
 9   10002155 32358465           98                 23                       109
10   10002155 33685454           68                 18                       126
# ℹ 69,474 more rows
# ℹ abbreviated name: ¹​`Non Invasive Blood Pressure systolic`
# ℹ 5 more variables: `Temperature Fahrenheit` <dbl>, los <dbl>, intime <dttm>,
#   outtime <dttm>, charttime <dttm>
#within first hour of ICU stay
 vital_measurements <- vital_measurements |> 
   group_by(subject_id, stay_id) |>
   filter(charttime >= intime, charttime <= intime + 3600) |>
   arrange(charttime, .by_group = TRUE) |>
   slice_tail(n=1) |>
   print(width = Inf)
# A tibble: 12,547 × 10
# Groups:   subject_id, stay_id [12,547]
   subject_id  stay_id `Heart Rate` `Respiratory Rate`
        <dbl>    <dbl>        <dbl>              <dbl>
 1   10000980 39765666           77                 23
 2   10002155 31090461           94                 18
 3   10002155 32358465           98                 23
 4   10002155 33685454           68                 18
 5   10002348 32610785           72                 14
 6   10002930 35629889          103                 22
 7   10002930 37049133           87                 20
 8   10003019 30676350           81                 20
 9   10003400 32128372          117                 15
10   10003400 34577403           97                 13
   `Non Invasive Blood Pressure systolic` `Temperature Fahrenheit`    los
                                    <dbl>                    <dbl>  <dbl>
 1                                    150                     98    0.498
 2                                    118                     96.9  0.859
 3                                    109                     97.7  0.859
 4                                    126                     95.9  0.859
 5                                    129                     97.9  9.79 
 6                                    138                     98.2  1.14 
 7                                    133                     99.8  1.14 
 8                                    124                     97.6  0.709
 9                                     58                     95.6 12.9  
10                                    121                     98   12.9  
   intime              outtime             charttime          
   <dttm>              <dttm>              <dttm>             
 1 2189-06-27 08:42:00 2189-06-27 20:38:27 2189-06-27 01:54:00
 2 2131-03-09 21:33:00 2131-03-10 18:09:21 2131-03-09 13:45:00
 3 2131-03-09 21:33:00 2131-03-10 18:09:21 2131-03-09 13:45:00
 4 2131-03-09 21:33:00 2131-03-10 18:09:21 2131-03-09 13:45:00
 5 2112-11-30 23:24:00 2112-12-10 18:25:13 2112-11-30 15:31:00
 6 2196-04-14 13:40:00 2196-04-15 16:54:44 2196-04-14 07:00:00
 7 2196-04-14 13:40:00 2196-04-15 16:54:44 2196-04-14 07:00:00
 8 2175-10-08 18:58:00 2175-10-09 11:59:16 2175-10-08 12:00:00
 9 2137-02-25 23:37:19 2137-03-10 21:29:36 2137-02-25 16:00:00
10 2137-02-25 23:37:19 2137-03-10 21:29:36 2137-02-25 16:00:00
# ℹ 12,537 more rows
vital_measurements
# A tibble: 12,547 × 10
# Groups:   subject_id, stay_id [12,547]
   subject_id  stay_id `Heart Rate` `Respiratory Rate` Non Invasive Blood Pres…¹
        <dbl>    <dbl>        <dbl>              <dbl>                     <dbl>
 1   10000980 39765666           77                 23                       150
 2   10002155 31090461           94                 18                       118
 3   10002155 32358465           98                 23                       109
 4   10002155 33685454           68                 18                       126
 5   10002348 32610785           72                 14                       129
 6   10002930 35629889          103                 22                       138
 7   10002930 37049133           87                 20                       133
 8   10003019 30676350           81                 20                       124
 9   10003400 32128372          117                 15                        58
10   10003400 34577403           97                 13                       121
# ℹ 12,537 more rows
# ℹ abbreviated name: ¹​`Non Invasive Blood Pressure systolic`
# ℹ 5 more variables: `Temperature Fahrenheit` <dbl>, los <dbl>, intime <dttm>,
#   outtime <dttm>, charttime <dttm>
 #only include columns with vital measurements and los
 vital_measurements_filtered <- vital_measurements |>
   select(-c(charttime, intime, outtime)) |>
   pivot_longer(cols = -c(subject_id, stay_id, los), names_to = "vital", values_to = "value") |>
   print(width = Inf)
# A tibble: 50,188 × 5
# Groups:   subject_id, stay_id [12,547]
   subject_id  stay_id   los vital                                value
        <dbl>    <dbl> <dbl> <chr>                                <dbl>
 1   10000980 39765666 0.498 Heart Rate                            77  
 2   10000980 39765666 0.498 Respiratory Rate                      23  
 3   10000980 39765666 0.498 Non Invasive Blood Pressure systolic 150  
 4   10000980 39765666 0.498 Temperature Fahrenheit                98  
 5   10002155 31090461 0.859 Heart Rate                            94  
 6   10002155 31090461 0.859 Respiratory Rate                      18  
 7   10002155 31090461 0.859 Non Invasive Blood Pressure systolic 118  
 8   10002155 31090461 0.859 Temperature Fahrenheit                96.9
 9   10002155 32358465 0.859 Heart Rate                            98  
10   10002155 32358465 0.859 Respiratory Rate                      23  
# ℹ 50,178 more rows
 ggplot(vital_measurements_filtered, aes(x = value, y = los)) +
  geom_point() +
  geom_smooth() +
  facet_wrap(~vital, scales = "free") +
  labs(x = "Lab Measurement", y = "Length of ICU Stay (LOS)") +
  ggtitle("Length of ICU Stay vs Last Available Lab Measurements Before ICU Stay")
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Warning: Removed 576 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 576 rows containing missing values (`geom_point()`).

  • Length of ICU stay los vs first ICU unit
#Distribution of First Care Unit
mimic_icu_cohort |>
  ggplot(aes(x = first_careunit, fill = "")) +
  geom_bar() +
  labs(title = "Distribution of First Care Unit", x = "First Care Unit", y = "Count")  +
  scale_fill_manual(values = "lightpink") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(legend.position = "none")

# Create boxplot of los vs first_careunit
ggplot(mimic_icu_cohort, aes(x = first_careunit, y = los)) +
  geom_boxplot() +
  labs(x = "First Care Unit", y = "Length of ICU Stay (LOS)") +
  ggtitle("Boxplot of LOS by First Care Unit") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_log10()  # Log scale for y-axis

#the distribution seems to be heavily skewed, so I applied a log scale for the y-axis